Importing and loading the necessary packages for the analysis.
if(!require("install.load")){
install.packages("install.load")
}
library('install.load')
install_load("readr", "readxl", "tidyr", "dplyr", "stringr", "tidyverse", "knitr", "rmarkdown", "ggplot2","lubridate", "fitdistrplus","stats4","MASS", "logspline", "gamlss","gamlss.add", "gamlss.dist","DiagrammeR", "plotly","kableExtra")Create a distribution for the logistics delay of component „K7”. Use the production date (“Produktionsdatum”) from the data set “Komponente_K7.csv” and the receiving date of incoming goods (“Wareneingang”) from “Logistikverzug_K7.csv” (logistics delay). You can assume that produced goods are issued one day after production date. For the model design in R, create a new data set “Logistics delay” that contains the required information from both data sets.
#import the relevant tables
comp_7 <- read_csv2("Data/Logistikverzug/Komponente_k7.csv")
log_delay_7 <- read_csv("Data/Logistikverzug/Logistikverzug_K7.csv")
#check if there are NA values
sum(is.na(comp_7))## [1] 0
sum(is.na(log_delay_7))## [1] 0
#create the joined dataset
logistics_delay <- comp_7 %>%
inner_join(log_delay_7, by = c("IDNummer"), suffix = c("_comp", "_log")) %>% #join the two ds through the ID number column
dplyr::select(Produktionsdatum, Wareneingang) %>% #keep only the columns that are important for the analysis
mutate(Produktionsdatum=as.Date(Produktionsdatum), Wareneingang=as.Date(Wareneingang)) %>% #turn the values into date
mutate(issuing_date=(Produktionsdatum+1)) %>% #goods are issued one day after their production
mutate(delay=as.integer(Wareneingang-issuing_date)) %>% #create a column "delay" with the difference btw receipt and production
mutate(weekday_issuing=wday(issuing_date-1), weekday_waren=wday(Wareneingang-1)) # %>% #-1 bc, after checking online, the calculated weedkday was wrong by 1
rm(comp_7) #remove the old datasets from the global environment
rm(log_delay_7)
#create a new csv file called "Logistics Delay" in the Logistikverzug folder
write.csv2(logistics_delay, "Data/Logistikverzug/Logistics Delay.csv", row.names=FALSE)How is the logistics delay distributed? Justify your choice with statistical tests and briefly describe your approach.
The distribution analysis will be performed following an iterative process. The CUllen & Frey graphs for both the discrete and the continuous case will give information about the possible best-fitting theoretical models. FUrthermore, the analysis will be based on a graphical comparison and on goodness of fitting parameters.
library(fitdistrplus)
#create a variable x that calls the "delay" variable of the new dataset
x <- logistics_delay$delay
#Result summary of the parameters of our model, to start the model distribution analysis
summary(x)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 6.08 7.00 14.00
The distribution is discrete, because it is based on single days of delay, without continuity. The blue point, called “observation”, gives the values of Kurtosis and Skeweness of the delay model. The discrete theoretical distributions that will be compared with the empirical one are the Normal, the Negative binomial and the Poisson.
descdist( data = x , discrete = T)## summary statistics
## ------
## min: 3 max: 14
## median: 6
## mean: 6.080437
## estimated sd: 1.012302
## estimated skewness: 0.5674067
## estimated kurtosis: 3.630055
#Comparison with the normal distribution
norm_x <- fitdist(x,"norm", discrete=TRUE)
plot(norm_x)#Results of the norm distribution
summary(norm_x)## Fitting of the distribution ' norm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## mean 6.080437 0.001828526
## sd 1.012300 0.001292957
## Loglikelihood: -438637.3 AIC: 877278.6 BIC: 877299.8
## Correlation matrix:
## mean sd
## mean 1.00000e+00 1.20413e-10
## sd 1.20413e-10 1.00000e+00
#Comparison with the negative binomial distribution
nbinom_x <- fitdist(x,"nbinom", discrete=TRUE)
plot(nbinom_x)#Results of the negative binomial distribution
summary(nbinom_x)## Fitting of the distribution ' nbinom ' by maximum likelihood
## Parameters :
## estimate Std. Error
## size 7.103259e+07 33.84264490
## mu 6.080529e+00 0.00445416
## Loglikelihood: -585889.5 AIC: 1171783 BIC: 1171804
## Correlation matrix:
## size mu
## size 1.000000e+00 3.290346e-05
## mu 3.290346e-05 1.000000e+00
#Comparison with the Poisson distribution
pois_x <- fitdist(x,"pois", discrete=TRUE)
plot(pois_x)#Results of the poisson distribution
summary(pois_x)## Fitting of the distribution ' pois ' by maximum likelihood
## Parameters :
## estimate Std. Error
## lambda 6.080437 0.004454092
## Loglikelihood: -585889.5 AIC: 1171781 BIC: 1171792
The empirical model is better described by the normal distribution. Indeed, for both Poisson’s and negative binomial’s distributions, a non-negligible difference can be noticed from the density and cumulative distribution function graphs.
A second approach in finding the best fitting model is conducted with the continuous distributions, in a discretized form: using bins to represent the continuous models. In this case, the most promising models, offered by the Cullen & Frey diagram, are the Weibull, log-normal, gamma and beta.
The beta model is immediately rejected, because all the values of this type of distribution must be between 0 and 1.
# Continuous case: discrete is false by default
descdist(x)## summary statistics
## ------
## min: 3 max: 14
## median: 6
## mean: 6.080437
## estimated sd: 1.012302
## estimated skewness: 0.5674067
## estimated kurtosis: 3.630055
#Comparison with the log-normal distribution
lnorm_x <- fitdist(x,"lnorm", discrete=TRUE)
plot(lnorm_x)#Results of the log-normal distribution
summary(lnorm_x)## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 1.7914898 0.0002973244
## sdlog 0.1646034 0.0002102052
## Loglikelihood: -430989.8 AIC: 861983.7 BIC: 862004.9
## Correlation matrix:
## meanlog sdlog
## meanlog 1.000000e+00 -3.183187e-12
## sdlog -3.183187e-12 1.000000e+00
#Comparison with the gamma distribution
gamma_x <- fitdist(x,"gamma", discrete=TRUE)
plot(gamma_x)#Results of the gamma distribution
summary(gamma_x)## Fitting of the distribution ' gamma ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 36.964428 0.09400297
## rate 6.079205 0.01556496
## Loglikelihood: -432129 AIC: 864262 BIC: 864283.3
## Correlation matrix:
## shape rate
## shape 1.0000000 0.9932448
## rate 0.9932448 1.0000000
#Comparison with the weibull distribution
weib_x <- fitdist(x,"weibull", discrete=TRUE)
plot(weib_x)#Results of the weibull distribution
summary(weib_x)## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 5.979374 0.007563124
## scale 6.515010 0.002087809
## Loglikelihood: -458880.8 AIC: 917765.6 BIC: 917786.9
## Correlation matrix:
## shape scale
## shape 1.000000 0.333769
## scale 0.333769 1.000000
Following the discrete approach, the normal distribution is the one that best fits the data of the case study. Whereas, by extending the analysis to the continuous cases, the best-fitting model is the log-normal. This can be noticed by simply looking at the graphs plotted with the fitdist function. Nonetheless, relatively good results are achieved with the gamma distribution too. Further analysis can be developed through the goodness of fit criteria.
A low value of chi-squared statistics suggests a high correlation between the two distributions.
In this case, all the values are extremely high: much bigger than what is considered acceptable by the critical table. In particular, gamma has the lowest value and log-norm the highest one. Nevertheless, a chi-squared p-value of 0 for all three cases indicates that the null hypothesis should be rejected, and an alternative one should be followed. Hence, the wrong results of the chi-squared values are confirmed false by the p-values.
The Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are mathematical criteria for model selection, and will be used for the analysis. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. Th BIC is an increasing function of the error variance. Therefore, the model is preferred when the values of AIC and BIC are low.
This confirms the better fitting of the log-Normal model, which has the lowest values of AIC and BIC.
#Goodness-of-fitting parameters analysis - norm, log-norm and gamma distributions
gofstat(list(norm_x,lnorm_x, gamma_x),fitnames = c("Normal", "log-Normal", "Gamma"))## Chi-squared statistic: 90595.99 92684.71 87044.54
## Degree of freedom of the Chi-squared distribution: 4 4 4
## Chi-squared p-value: 0 0 0
## Chi-squared table:
## obscounts theo Normal theo log-Normal theo Gamma
## <= 4 8518 6108.9169 2119.385 3087.668
## <= 5 81657 37693.4784 39061.370 38746.042
## <= 6 124094 99737.1964 112264.550 108244.993
## <= 7 67161 107219.4836 99688.691 102490.798
## <= 8 19973 46853.6953 41061.533 42944.490
## <= 9 4260 8275.6701 10194.366 9562.592
## > 9 827 601.5594 2100.105 1413.417
##
## Goodness-of-fit criteria
## Normal log-Normal Gamma
## Akaike's Information Criterion 877278.6 861983.7 864262.0
## Bayesian Information Criterion 877299.8 862004.9 864283.3
Determine the mean of the logistics delay (watch out for weekends). Please interpret this number and discuss possible alternatives.
The weekends will not interfere with the calculation of the logistics delay, because - as it can be noticed from the logistics delay dataset - the operations of production, issuing and reception of goods are performed in all the days of the week, weekends included.
mean_value <- mean(logistics_delay$delay) #calculate the mean
print(paste("The mean of the delay is:", mean_value))## [1] "The mean of the delay is: 6.08043655584195"
#double check that the operations are conducted also in the weekend
weekend_days <- c(6,7)
weekend_days %in% logistics_delay$weekday_issuing## [1] TRUE TRUE
weekend_days %in% logistics_delay$weekday_waren## [1] TRUE TRUE
#calculate the mean in the case there were no operations in the weekend
logistics_delay_Noweekend <- logistics_delay %>%
filter(weekday_issuing != 6 &
weekday_issuing != 7)%>%
filter(weekday_waren != 6 &
weekday_waren != 7)
sort(unique(logistics_delay_Noweekend$weekday_issuing))## [1] 1 2 3 4 5
sort(unique(logistics_delay_Noweekend$weekday_waren))## [1] 1 2 3 4 5
meanNoWeekend <- mean(logistics_delay_Noweekend$delay)
print(paste("Mean without operations in the weekend:", meanNoWeekend))## [1] "Mean without operations in the weekend: 6.20806431822564"
#The mean value of the delay is higher, as expected. Visualize the distribution in an appropriate way by displaying the histogram and the density function using “plotly”. Please describe how you selected the size of the bins.
The distribution is plotted both with the plot_ly and the ggplotly functions.
The selected size of the bin is 1. This size allows to have a complete visualization of the values, without interruption on the x-axis. In the ggplot case, the density curve has been modified with the “adjust” argument of geom_density, which is a multiplicate bandwidth adjustment. This allows to keep a binwidth of 1, thus an appropriate data visualization.
#Plotting histogram and density with the plot_ly function
#library(plotly)
density <- density(x) #generate a variable to call the function density() in the plot
plot_ly(x = x) %>%
add_histogram(name = "Histogram") %>% #create the histogram and add the density curve
add_lines(x = density$x, y = density$y, fill = "tozeroy", yaxis = "y2", name = "Density") %>% #"tozeroy" to fill the plot area; "yaxis" adds a second y-axis
layout(yaxis2 = list(overlaying = "y", side = "right")) #"overlaying" allows to have the overlay of the two plots; "side" moves the 2nd y-axis reference to the right At first, the graph has been plotted with a binwidth of 0.185, in order to have a match between the peak values of the histogram and the ones of the density.
#Plot histogram and desnity with ggplot
b <- ggplot(logistics_delay, aes(x = x)) +
geom_histogram(aes(y=..density..),
binwidth=0.185, fill="blue", color="black") +
scale_x_continuous(breaks = seq(0,15,1)) + #to have all the integers from o to 13 on the x-axis
geom_density(col="red") +
labs(title = "Logistics delay", x = "Days of delay", y = "Density") #add the labels
ggplotly(b) #convert a ggplot2 to a plotly objectAfterwards, an adjustment value has been used, in order to have a continuous representation of the density function, over the histogram.
#Plotting histogram and density with ggplot
a <- ggplot(logistics_delay, aes(x = x)) +
geom_histogram(aes(y=..density..),
binwidth=1, fill="blue", color="black") +
scale_x_continuous(breaks = seq(0,15,1)) + #to have all the integers from o to 13 on the x-axis
geom_density(col="red", adjust=5.9) +
labs(title = "Logistics delay", x = "Days of delay", y = "Density") #add the labels
ggplotly(a) #convert a ggplot2 to a plotly objectPlease describe how you proceed, if you have to create a decision tree, which is describing the classification problem to classify whether the component (K7) is defective (Fehlerhaft) or not?
The definition of a decision tree is based on a classification problem. Therefore, it represents the classic work environment in which supervised learning is exploited.
The goal is to train a decision tree that predicts whether the component K7 is defective or not, depending on some specific characteristics (attributes).
Origin has only 1 value, thus it is not an informative attribute. Whereas, maunf_num and work_num are strictly related, hence only one of them can be exploited. The decision tree will be based on the manufacturer (Herstellernummer), and on the production date origin. However, the latter can assume 2924 values: too many to analyze. Therefore, the manufacturer number will be the only root node in the following simplified decision tree.
The probability of having a defective component is slightly higher for body parts produced by the manufacturer 112: 10.02447% vs the 10.02251% of manufacturer 114. However, the probability of defectiveness is relatively low for both the cases. Hence, each case should be investigated individually, through the production date origin.
#Import the dataset
#create fz to properly read the txt file
read_txt <- function(path, replacement){
path %>%
read_file() %>%
str_replace_all(replacement) %>%
str_replace_all(c("\n\"\\d+\"," = "\n", "\nNA," = "\n")) %>%
read_delim(show_col_types = FALSE)
}
#Load K7
comp_k7 <- read_txt("Data/Komponente/Komponente_K7.txt",
replacement = c("\t" = ",", "\r" = "\n")) %>%
dplyr::select(-Fehlerhaft_Datum, -Fehlerhaft_Fahrleistung, -X1) #remove date and km at which defectiveness was found#list all the attributes values
manuf_num <- comp_k7$Herstellernummer %>% unique()
work_num <- comp_k7$Werksnummer %>% unique()
prod_date <- comp_k7$Produktionsdatum_Origin_01011970 %>% unique()
origin <- comp_k7$origin %>% unique()
#filter for the two values of interest
comp_k7_114 <- comp_k7 %>% filter(Herstellernummer == manuf_num[1])
comp_k7_112 <- comp_k7 %>% filter(Herstellernummer == manuf_num[2])
#sum the values of the defectiveness column to find their number, because they only assume values of 0 or 1.
yes_112 <- as.integer(sum(comp_k7_112$Fehlerhaft)) #number of defectiveness for manuf 112
yes_114 <- as.integer(sum(comp_k7_114$Fehlerhaft)) #number of defectiveness for manuf 114
#an alternative would be to filter for Fehlerhaft == 1 and then count the rows of the filtered ds
tot_cases <- nrow(comp_k7_112) #same for both 112 and 114
rate_112 <- yes_112/tot_cases
rate_114 <- yes_114/tot_cases
print(paste("The rate of defective components with manufacturer number 112 is:", rate_112))## [1] "The rate of defective components with manufacturer number 112 is: 0.100244706189435"
print(paste("The rate of defective components with manufacturer number 114 is:", rate_114))## [1] "The rate of defective components with manufacturer number 114 is: 0.10022512969428"
#crate the flowchart
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
# edge definitions with the node IDs
tab1 -> tab2;
tab2 -> tab3 -> tab6
tab2 -> tab4 -> tab5
}
[1]: 'Is the component defective?'
[2]: 'Manufacturer number'
[3]: '114'
[4]: '112'
[5]: 'yes/no'
[6]: 'yes/no'
")Why does it make sense to store the available data in separate files instead of saving everything in a huge table? Name at least four benefits.
The available tables represent a typical data base structure. How is it called?
The database structure of the available tables is a multi-file relational (or structural) database. The data are divided into many tables, that are linked one to another through some specific keys. For instance, the datasets “Bestandteile_Fahrzeug_ID” relate to “Fahrzeug_ID”, through the key variable “ID_Fahrzeug”. The programming interface for database interaction, used in this type of database structure, is the Structured Query Language (SQL). The alternative database structure is called “single file”, and all the contained data are stored in a single flat file.
How many of the components T16 ended up in vehicles registered in Adelshofen?
The dataset containing the information of part 16, and the dataset of the components in which part 16 was installed were manually imported and cleaned.The latter were filtered and merged, to create a final dataset with all the IDs of the components in which part 16 was present. Later, these IDs have been used to filter the datasets of the four car types and find the IDs of the cars that contain those specific components (thus, part 16).
Finally, a new dataset with the vehicle IDs has been created and exploited to filter the registration file, containing all the IDs of the car registered in Adhelshofen. It was therefore possible to complete the calculation required by the task.
#Loading the functions to clean the txt file
Drop_Cols <- function(df){
if('V1' %in% names(df)) {df <- df %>% dplyr::select(-V1)}
if('X1' %in% names(df)) {df <- df %>% dplyr::select(-X1)}
if('X' %in% names(df)) {df <- df %>% dplyr::select(-X)}
if('...1' %in% names(df)) {df <- df %>% dplyr::select(-...1)}
return(df)
}
#Function to fix unstructured Data
Fix_unstruct_dat <- function(data){
#If the dataset has 14 columns we must convert it to only 7 columns:
if (ncol(data) == 14) {
df1 <- data[,1:7]
names(df1) <- sub("\\.x", "", names(df1))
df2 <- data[,8:14]
names(df2) <- sub("\\.y", "", names(df2))
#combine dataframes
df <- dplyr::bind_rows(df1, df2) %>% distinct() %>% drop_na(Werksnummer)
return(df)
#Same logic as in the second case when it has 21:
} else if(ncol(data) == 21) {
df1 <- data[,1:7]
names(df1) <- sub("\\.x", "", names(df1))
df2 <- data[,8:14]
names(df2) <- sub("\\.y", "", names(df2))
df3 <- data[,15:21]
# Combine Datasets
df <- dplyr::bind_rows(df1, df2, df3) %>% distinct() %>% drop_na(Werksnummer)
return(df)
} else {
print("Data cannot be processed with this function yet, prior changes are required!")
}
}#Import and clean the dataset of Part 16
P_16 <- read_txt("Data/Einzelteil/Einzelteil_T16.txt",
replacement = c(" \\| \\| " = ",", "\t" = "\n"))
P_16 <- P_16 %>% Drop_Cols() %>% Fix_unstruct_dat()#Import the files with the components in which the part 16 is installed
Bes_Komp_K2ST2 <- read_csv2("Data/Komponente/Bestandteile_Komponente_K2ST2.csv")
Bes_Komp_K2LE2 <- read_csv2("Data/Komponente/Bestandteile_Komponente_K2LE2.csv")
header_K2ST2 <- colnames(Bes_Komp_K2ST2) #look for the columns names of the dataset, to check the presence of part 16
header_K2LE2 <- colnames(Bes_Komp_K2LE2) #look for the columns names of the dataset, to check the presence of part 16
Bes_Komp_K2ST2 <- Bes_Komp_K2ST2 %>% dplyr::select(c("ID_T16", "ID_K2ST2")) #keep only the useful variables
Bes_Komp_K2LE2 <- Bes_Komp_K2LE2 %>% dplyr::select(c("ID_T16", "ID_K2LE2"))
#Merge the two datasets and create a variable ID_Comp that contains all the IDs of the components in which part 16 is installed
final_Best_Komp <- dplyr::bind_rows(Bes_Komp_K2LE2, Bes_Komp_K2ST2) %>% #merge the datasets combining the rows
pivot_longer(cols=c("ID_K2ST2", "ID_K2LE2"), values_to = "ID_Comp") %>% #merge the two variables into a new variable ID_Comp
drop_na() %>% #remove NA values
dplyr::select(-name) #remove name variable
#Import the 4 car types and apply filtering to know if they contain the components made with part 16
type11 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ11.csv")# %>%
#The components made with part 16 are seats -> ID_Sitze variable will be investigated
type11_filt <- final_Best_Komp %>%
left_join(type11, by = c("ID_Comp" = "ID_Sitze")) %>% #filter for the components that contain part 16
drop_na() %>% #remove NA values
dplyr::select("ID_Comp","ID_Fahrzeug") #keep only the important columns
#Apply the same procedure to the other car types
type12 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv")
type12_filt <- final_Best_Komp %>% left_join(type12, by = c("ID_Comp" = "ID_Sitze")) %>% drop_na() %>% dplyr::select("ID_Comp","ID_Fahrzeug")
type21 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv")
type21_filt <- final_Best_Komp %>% left_join(type21, by = c("ID_Comp" = "ID_Sitze")) %>% drop_na() %>% dplyr::select("ID_Comp","ID_Fahrzeug")
type22 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv") #%>%
type22_filt <- final_Best_Komp %>% left_join(type22, by = c("ID_Comp" = "ID_Sitze")) %>% drop_na() %>% dplyr::select("ID_Comp","ID_Fahrzeug")
#Merge all the IDs of the cars that contain part 16
cars_with_p16 <- dplyr::bind_rows(type11_filt, type12_filt, type21_filt, type22_filt) %>% #merge the datasets combining the rows
dplyr::select("ID_Fahrzeug") %>% #keep only the car ID variable
distinct() #check that there are not duplicated rows
#Import the final dataset
registration_ds <- read_csv2("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv")
#Merge the two datasets
Adelshofen_cars_with_p16<- cars_with_p16 %>%
left_join(registration_ds, by = c("ID_Fahrzeug" = "IDNummer")) %>% #merge the datasets, filtering for the IDs of the car containing part 16
filter(Gemeinden == "ADELSHOFEN") %>% #filter for vehicles registered in Adelshofen
dplyr::select(-...1, -Zulassung) %>% distinct()
print(paste("The number of components T16 ended up in vehicles registered in Adelshofen is:", nrow(Adelshofen_cars_with_p16))) #convert into char and print## [1] "The number of components T16 ended up in vehicles registered in Adelshofen is: 48"
Which data types do the attributes of the registration table “Zulassungen_aller_Fahrzeuge” have? Put your answers into a table which is integrated into your Markdown document and describe the characteristics of the data type(s).
The data types of the dataset Zulassungen_alle_Fahrzeuge are: a numeric, two characters, and a date.
The numeric data type includes all the real numbers, with or without decimal values. The character data type is used to specify character or string values in a variable. The date data type dates is stored internally as the number of days or seconds from some reference date.
Zulassungen_alle_Fahrzeuge <- read_csv2("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv")
# create matrix with 4 columns and 2 rows
data= matrix(c(class(1), class("12-1-12-1"), class("LEIPZIG"), "date", 1, "12-1-12-1", "LEIPZIG", "2009-01-01"), ncol=4, byrow=TRUE)
# specify the column names and row names of matrix
colnames(data) = colnames(Zulassungen_alle_Fahrzeuge)
rownames(data) <- c('Data type','Ex. value')
final=as.table(data) #convert the matrix into a table
kable(final)%>%
kable_styling(bootstrap_options = "striped",full_width = F, position = "center") %>% column_spec(1:ncol(final), width_max = "12em", width_min = "8em")#show the table| …1 | IDNummer | Gemeinden | Zulassung | |
|---|---|---|---|---|
| Data type | numeric | character | character | date |
| Ex. value | 1 | 12-1-12-1 | LEIPZIG | 2009-01-01 |
You want to publish your application. Why does it make sense to store the records on the database of a server? Why can’t you store the records on your personal computer?
What is an easy way to make your application available to your customers? Please name 4 aspects.
On 11.08.2010 there was a hit and run accident. There is no trace of the license plate of the car involved in the accident. The police asks for your help, as you work for the Federal Motor Transport Authority, and asks where the vehicle with the body part number “K5-112-1122-79” was registered.
The data sets of the 4 possible car types (type11, type12, type21, and type 22) have been imported, and filtered to find where the vehicle with the body part K5-112-1122-79 comes from. Once the ID of that vehicle has been found, it was possible to look in the registration dataset and filter its origin.
#import the datasets of the 4 vehicle types and filter them to look for the right car body number
car_type11 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ11.csv") %>%
filter(ID_Karosserie == "K5-112-1122-79")
car_type21 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv") %>%
filter(ID_Karosserie == "K5-112-1122-79")
car_type22 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv") %>%
filter(ID_Karosserie == "K5-112-1122-79")
#contains the data of the vehicle we are looking for
car_type12 <- read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv") %>%
filter(ID_Karosserie == "K5-112-1122-79")Conclude the analysis
#Move to the second dataset; filter it for the ID_Fahrzeug previously found
typ_12 <- read_csv2("Data/Fahrzeug/Fahrzeuge_OEM1_Typ12.csv") %>%
filter(ID_Fahrzeug == "12-1-12-82")
#check variables present in both datasets
typ_12[, colnames(typ_12) %in% colnames(car_type12)]## # A tibble: 1 × 2
## ...1 ID_Fahrzeug
## <dbl> <chr>
## 1 82 12-1-12-82
typ_12[, colnames(typ_12) %in% colnames(Zulassungen_alle_Fahrzeuge)]## # A tibble: 1 × 1
## ...1
## <dbl>
## 1 82
#Move to the final dataset: filter for the IDNummer previously found
Registr_veh <- Zulassungen_alle_Fahrzeuge %>%
filter(IDNummer=="12-1-12-82") %>%
dplyr::select(Gemeinden) #select only the column of interest
print(paste("The damaged vehicle was registred in", Registr_veh[,])) #paste converts into char and concatenates the vectors## [1] "The damaged vehicle was registred in ASCHERSLEBEN"